ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group00a.txt / 000082_icon-group-sender _Mon May 1 21:00:51 2000.msg < prev next >

Wrap

Internet Message Format | 2001-01-03 | 3KB

Return-Path: <icon-group-sender> Received: (from root@localhost) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) id UAA10001 for icon-group-addresses; Mon, 1 May 2000 20:58:42 -0700 (MST) Message-Id: <200005020358.UAA10001@baskerville.CS.Arizona.EDU> Date: Fri, 28 Apr 2000 12:31:31 +1200 (NZST) From: "Richard A. O'Keefe" <ok@atlas.otago.ac.nz> To: NOSPAM.Frank.Lhota@lexma.meitech.com, icon-group@optima.CS.Arizona.EDU Subject: Re: Is Anyone Working On A Unicode Version Of Icon? Errors-To: icon-group-errors@optima.CS.Arizona.EDU Status: RO "Frank J. Lhota" <NOSPAM.Frank.Lhota@lexma.meitech.com> wrote: > We may, unfortunately, be at a point where this is too late to change the > predefined csets. ... The right fix might be to add new prefined csets, > so that new code could be properly internationalized. May I suggest instead new *functions* for getting csets? lower_case_cset(Locale) -> Cset upper_case_cset(Locale) -> Cset title_case_cset(Locale) -> Cset letter_cset(Locale) -> Cset and so on, where a Locale could be a string such as "en_NZ". > If we were to use the packed array of bits approach for wide characters, a > cset would be represented by 2048 words -- Yikes! Clearly, we would need to > use a different internal representation. It's worse than that. One of the core design principles of Unicode was that it coded characters, not languages; language tagging was supposed to be left to a higher-level protocol. Well, that's not true in Unicode 3.0 any more. There are language markup tags in Plane 14. Now 2048 words is enough for one plane... Plane 0 is now pretty much full, and useful stuff is going in Plane 1, so the age of 16 bit characters didn't really last very long. > I would suggest representing a cset as an ordered list of character > intervals. ... Most csets that actually occur in > applications, however, could be represented by a handful of intervals. Not unless you have rather large hands. I just ran a little AWK script against unidata2.txt (I don't have the Unicode 3.0 version yet), and found DIGIT 22 intervals LETTER 179 intervals CAPITAL LETTER 337 intervals SMALL LETTER 349 intervals I suspect that letter csets will be wanted rather often. Perhaps a hybrid representation might be better: - a bitmap for the first 256 characters of Unicode - a "window pointer" selecting another block of 256 characters, so that the current locale (Cyrillic, or Greek, or Arabic, or whatever) can be accessed fairly fast - a bitmap for the characters in the selected window - an ordered list of intervals for the remaining characters